DASHI: Dataset shift analysis and characterization in python¶

David Fernández Narro, Pablo Ferri Borredá, Ángel Sanchez-García, Juan M. García-Gómez, Carlos Sáez¶

*dfernar@upv.edu.es **pabferb2@upv.es ***ansan12a@upv.es ****juanmig@upv.es *****carsaesi@upv.es

Library: dashi¶

Welcome to the dashi Python library! This notebook demonstrates the key features and functionality of the library, as well as an example to help you to understand its usage and applications for supervised multi-temporal/source dataset shift characterization.

1 Introduction¶


1.1 What is dashi?¶

dashi is a Python library crafted to analyze and characterize temporal and multi-source dataset shifts. It equips users with powerful tools for both supervised and unsupervised evaluations, making it easier to detect, understand, and address changes in data distributions with confidence and precision.

Why is dashi important?¶

Dataset shifts—unexpected changes in data distribution over time or across sources—can significantly impact the performance of machine learning models. dashi helps users not only identify these shifts but also provides insights to mitigate their effects, ensuring robust and reliable models.

1.2 Key Features¶

Biomedical data repositories and proprietary biomedical research databases are expanding rapidly, both in terms of sample size and the diversity of collected variables. This growth is driven by the widespread adoption of data-sharing initiatives, advancements in technological infrastructures, and the continuous population of these repositories over extended periods (Gewin, 2016; Andreu-Perez et al., 2015).

However, this increased availability of data presents significant challenges. The integration of data from diverse sources over time introduces potential issues that can impede its reuse in research contexts, such as population studies or statistical and machine learning modeling. Differences in protocols, population characteristics, and unforeseen biases, whether introduced by systems or human error, can lead to temporal or multi-source dataset shifts (Quiñonero-Candela, 2009; Moreno-Torres et al., 2012). These shifts manifest as changes in statistical distributions, altering reference characteristics and potentially degrading model performance. Addressing these issues is particularly critical for ensuring robust and reliable predictive modeling and population health studies, as temporal shifts in electronic health records (EHRs) have been identified as a major concern (Sáez et al., 2020; Schlegel & Ficheur, 2017).

This variability underscores the importance of addressing Data Quality (DQ) as a critical factor in enabling the reliable reuse of biomedical data. By detecting, understanding, and mitigating dataset shifts, researchers can ensure the robustness and validity of their findings, even within the context of an evolving and diverse data landscape.

1.2.1 Supervised Characterization¶

  • Leverage Random Forest classifiers or regressors trained on batched data (temporal or multi-source) to analyze dataset shifts.
  • Evaluate how shifts in the data influence model performance and uncover potential areas of degradation.
  • Gain actionable insights to adapt your models to evolving data landscapes.

1.2.2 Unsupervised Characterization¶

  • Detect and interpret temporal dataset shifts without relying on labeled data by visualizing patterns of data variability.
  • Core capabilities include:
    • Estimating statistical distributions over time, capturing the essence of data changes (Sáez et al., 2015)..
    • Projecting these distributions onto non-parametric statistical manifolds to reveal hidden patterns of temporal variability (Sáez & García-Gómez, 2018)..
    • Visualizing latent trends and shifts, providing a deeper understanding of how data evolves (Sáez et al., 2016)..

With dashi, users can confidently address the challenges of dynamic datasets, ensuring their models remain robust in real-world applications.

1.3 Installation¶

You can install dashi using pip:

pip install dashi

Or install from source:

git clone https://github.com/bdslab-upv/dashi
cd dashi
pip install .

2 Data pre-processing¶


2.1 Load the CSV input file¶

The first step is to read the CSV file that contains the data for the analysis. To do it, the user can apply the read_csv function from pandas library. An example of how to read the CSV file is shown next:

In [1]:
import os

import pandas as pd

path = r'C:\Users\David\Desktop\Datasets\datos_mexico\COVID'

# Cambiar con la ubicación final del dataset
dataset = pd.read_csv(os.path.join(path, 'SAMPLE_FULLCOVIDMEXICO.csv'), low_memory=False)
dataset = dataset.drop(columns=['FECHA_ACTUALIZACION', 'ID_REGISTRO', 'FECHA_SINTOMAS', 'FECHA_DEF', 'YEAR'])

This dataset corresponds to Mexico's COVID-19 data from 2020 to 2024. It is a public dataset availabe at (https://www.gob.mx/salud/documentos/datos-abiertos-152127). In this example, we have selected a random subset of 500,000 patients from the original dataset. It should be mentioned that in 2024, the 'RESULTADO_LAB' variable was replaced by 2 new variables: 'RESULTADO_PCR' and 'RESULTADO_PCR_COINFECCION'. The variable 'CLASIFICACION_FINAL_FLU' was also introduced in 2024.

For the supervised characterization, data must not contain nan values. The user should decide how to deal with them. For this example, we will just drop all the rows with nan values:

In [2]:
data_without_nan = dataset.dropna(axis=1, how='any')
del dataset

Currently, you do not need to format data types before using the supervised characterization functions, unlike the unsupervised approach. A future goal is to adapt the format_data function used for the unsupervised characterization so it can be reused for supervised characterization preprocessing.

3 Data analysis: Supervised characterization¶


3.1 Multi-Batch model training and validation¶

The estimate_multibatch_models function automatically trains RandomForest based models across multiple batches (temporal or source) for both classification and regression tasks. Additionally, it validates each trained model's performance on every other batch. Requires specifying one target variable (regression or classification) and at least one numerical or categorical input feature within the input DataFrame. At the same time, it is necessary to provide either a date variable (indicating the period with the corresponding argument) or a source variable. The date variable must be a valid date, and the source variable categories need to be specified as strings. Additionally, it is recommended that the amount of data in each batching group be sufficient for statistical representativeness. This function has as input the following arguments:

  • data: The input DataFrame containing both numerical and categorical features, as well as the target variables.
  • inputs_numerical_column_names: List of column names representing categorical input features, if applicable.
  • output_regression_column_name: Column name for the classification target variable, if applicable.
  • date_column_name: Column name containing date or time information for temporal batching in string format, if applicable. Mandatory for temporal dataset shift characterization
  • source_column_name: Column name representing the source of the data, if applicable. Mandatory for temporal dataset shift characterization
  • period: Period for batching the data ('month' or 'year') when using temporal batching.
  • learning_strategy: Defines the learning strategy: 'from_scratch' or 'cumulative'.

The estimate_multibatch_models returns a dictionary containing the calculated metrics for each batch and model combination .

The regression metrics storaged are:

  • 'MEAN_ABSOLUTE_ERROR'
  • 'MEAN_SQUARED_ERROR'
  • 'ROOT_MEAN_SQUARED_ERROR'
  • 'R_SQUARED'

The classification metrics storaged are:

  • 'AUC_{class_identifier}'
  • 'AUC_MACRO'
  • 'LOGLOSS'
  • 'RECALL_{class_identifier}'
  • 'PRECISION_{class_identifier}'
  • 'F1-SCORE_{class_identifier}'
  • 'ACCURACY'
  • 'RECALL_MACRO'
  • 'RECALL_MICRO'
  • 'RECALL_WEIGHTED'
  • 'PRECISION_MACRO'
  • 'PRECISION_MICRO'
  • 'PRECISION_WEIGHTED'
  • 'F1-SCORE_MACRO'
  • 'F1-SCORE_MICRO'
  • 'F1-SCORE_WEIGHTED'
Example of characterization by temporal batches:¶
In [3]:
import dashi as ds

LABEL_NAME = 'CLASIFICACION_FINAL_COVID'

INPUT_CATEGORICAL_VARIABLES = ['ORIGEN', 'SECTOR', 'ENTIDAD_UM', 'SEXO', 'ENTIDAD_NAC', 'ENTIDAD_RES',
                               'MUNICIPIO_RES', 'TIPO_PACIENTE', 'INTUBADO',
                               'NEUMONIA', 'EDAD', 'NACIONALIDAD', 'EMBARAZO', 'HABLA_LENGUA_INDIG',
                               'INDIGENA', 'DIABETES', 'EPOC', 'ASMA', 'INMUSUPR', 'HIPERTENSION',
                               'OTRA_COM', 'CARDIOVASCULAR', 'OBESIDAD', 'RENAL_CRONICA', 'TABAQUISMO',
                               'OTRO_CASO', 'TOMA_MUESTRA_LAB', 'TOMA_MUESTRA_ANTIGENO',
                               'RESULTADO_ANTIGENO', 'MIGRANTE',
                               'PAIS_NACIONALIDAD', 'PAIS_ORIGEN', 'UCI']

metrics_temporal = ds.estimate_multibatch_models(data=data_without_nan,
                                                 inputs_categorical_column_names=INPUT_CATEGORICAL_VARIABLES,
                                                 output_classification_column_name=LABEL_NAME,
                                                 date_column_name='FECHA_INGRESO',
                                                 period='year',
                                                 learning_strategy='cumulative'
                                                 )
Learning and testing over experiences:   0%|          | 0/30 [00:00<?, ?it/s]
Example of characterization by multi-source batches:¶

The source column must be in string format. Otherwise, the subsequent representation of the results will not be correct.

In [4]:
INPUT_CATEGORICAL_VARIABLES = ['ORIGEN', 'ENTIDAD_UM', 'SEXO', 'ENTIDAD_NAC', 'ENTIDAD_RES',
                               'MUNICIPIO_RES', 'TIPO_PACIENTE', 'INTUBADO',
                               'NEUMONIA', 'EDAD', 'NACIONALIDAD', 'EMBARAZO', 'HABLA_LENGUA_INDIG',
                               'INDIGENA', 'DIABETES', 'EPOC', 'ASMA', 'INMUSUPR', 'HIPERTENSION',
                               'OTRA_COM', 'CARDIOVASCULAR', 'OBESIDAD', 'RENAL_CRONICA', 'TABAQUISMO',
                               'OTRO_CASO', 'TOMA_MUESTRA_LAB', 'TOMA_MUESTRA_ANTIGENO',
                               'RESULTADO_ANTIGENO', 'MIGRANTE',
                               'PAIS_NACIONALIDAD', 'PAIS_ORIGEN', 'UCI']

print(data_without_nan['SECTOR'].value_counts())
data_without_nan = data_without_nan[data_without_nan['SECTOR'] != 99]
SECTOR
12    256386
4     184534
9      32143
6      13783
3       5467
5       2205
8       1989
10      1010
15       900
11       638
7        407
13       312
2        112
1         81
99        33
Name: count, dtype: int64
In [5]:
import dashi as ds

# Cast the source column into str format
data_without_nan['SECTOR'] = data_without_nan['SECTOR'].apply(str)

metrics_source = ds.estimate_multibatch_models(data=data_without_nan,
                                               inputs_categorical_column_names=INPUT_CATEGORICAL_VARIABLES,
                                               output_classification_column_name=LABEL_NAME,
                                               source_column_name='SECTOR',
                                               period='year',
                                               learning_strategy='from_scratch')
Learning and testing over experiences:   0%|          | 0/210 [00:00<?, ?it/s]

4 Data Visualization: Supervised characterization¶


4.1 Plot models' performance metrics¶

The plot_multibatch_performance function displays a heatmap of the specified metric for multiple batches of training and test models from the metrics dictionary obtained by the estimate_multibatch_models function. The function takes a dictionary of metrics and filters them based on the metric identifier. It then generates a heatmap where the x-axis represents the test batches, the y-axis represents the training batches, and the color scale indicates the values of the specified metric. This function has as input the following arguments:

  • metrics: A dictionary where keys are tuples of (training_batch, test_batch, dataset_type), and values are the metric values for the corresponding combination. The dataset_type should be 'test' to include the metric in the heatmap.

  • metric_name: The name of the metric to visualize. The function will filter metrics based on this identifier and only plot those for the 'test' set. Regression metric names, when applicable:

    • 'MEAN_ABSOLUTE_ERROR'
    • 'MEAN_SQUARED_ERROR'
    • 'ROOT_MEAN_SQUARED_ERROR'
    • 'R_SQUARED'

    Classification metric names, when applicable:

    • 'AUC_{class_identifier}'
    • 'AUC_MACRO'
    • 'LOGLOSS'
    • 'RECALL_{class_identifier}'
    • 'PRECISION_{class_identifier}'
    • 'F1-SCORE_{class_identifier}'
    • 'ACCURACY'
    • 'RECALL_MACRO'
    • 'RECALL_MICRO'
    • 'RECALL_WEIGHTED'
    • 'PRECISION_MACRO'
    • 'PRECISION_MICRO'
    • 'PRECISION_WEIGHTED'
    • 'F1-SCORE_MACRO'
    • 'F1-SCORE_MICRO'
    • 'F1-SCORE_WEIGHTED'

This function generates and displays an interactive heatmap using Plotly, and does not return any value.

In [6]:
import plotly.io as pio

pio.renderers.default = 'notebook'

ds.plot_multibatch_performance(metrics=metrics_temporal,
                               metric_name='RECALL_MACRO')
In [7]:
pio.renderers.default = 'notebook'

ds.plot_multibatch_performance(metrics=metrics_temporal,
                               metric_name='F1-SCORE_MACRO')
In [8]:
pio.renderers.default = 'notebook'

ds.plot_multibatch_performance(metrics=metrics_source,
                               metric_name='RECALL_MACRO')
In [9]:
pio.renderers.default = 'notebook'

ds.plot_multibatch_performance(metrics=metrics_source,
                               metric_name='F1-SCORE_MACRO')

5 Metrics arrangement¶


If the user wishes to store some specific metric in a DataFrame, the arrange_performance_metrics allows to extract a subset of metrics from the resulting dictionary into a pandas DataFrame. This function has as input the following arguments:

  • metrics: A dictionary containing the calculated metrics for each batch and model combination resulting from the estimate_multibatch_models function.
  • metric_name: The name of the metric to be selected from the metrics dictionary. The available metrics are described above.

The arrange_performance_metrics returns a DataFrame where the rows represent the combinations and the columns represent the metric values, with the index corrected for cumulative learning strategies.

In [10]:
metrics_temporal_frame = ds.arrange_performance_metrics(metrics=metrics_temporal, 
                                               metric_name='RECALL_MACRO')

6 Interpretation of the results¶


In real-world applications, machine learning models are frequently trained on specific datasets but deployed in environments where data distributions may shift over time due to factors such as temporal changes, varying data sources, or evolving user behaviors (Chen et al., 2024; Shao et al., 2024). These distribution shifts can lead to degraded or biased model predictions, undermining the reliability and fairness of deployed systems (Gama et al., 2014; Moreno-Torres et al., 2012). The dashi Python library addresses this challenge by enabling supervised detection of multi-temporal and multi-source dataset shifts. By analyzing how models trained on different data batches perform across new domains, dashi provides valuable insights into a model’s generalization capabilities. Specifically, if a model exhibits decreased performance when evaluated on alternative data batches, it may indicate that the model is biased toward the training data or that there are significant differences between the training and evaluation data distributions. This diagnostic capability is crucial for identifying and mitigating potential biases, thereby enhancing model robustness and applicability in dynamic real-world settings (Sáez et al., 2024; Wang et al., 2024). dashi is an important tool for people who want to keep their machine learning models performing well and fairly as data changes over time.

7 Summary of dashi supervised characterization functions¶


Table 1: Functions in dashi Python library

Input Object Function Output Generated
DataFarme estimate_multibatch_models Dict[str, float]
Dict[str, float] plot_multibatch_performance None
Dict[str, float] arrange_performance_metrics DataFrame

References¶


Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C., & Yang, G.-Z. (2015). Big Data for Health. IEEE Journal of Biomedical and Health Informatics, 19(4), 1193-1208. IEEE Journal of Biomedical and Health Informatics. https://doi.org/10.1109/JBHI.2015.2450362

Chen, M., Shen, L., Fu, H., Li, Z., Sun, J., & Liu, C. (2024). Calibration of Time-Series Forecasting: Detecting and Adapting Context-Driven Distribution Shift. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 341-352. https://doi.org/10.1145/3637528.3671926

Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 1-37. https://doi.org/10.1145/2523813

Gewin, V. (2016). Data sharing: An open mind on open data. Nature, 529(7584), 117-119. https://doi.org/10.1038/nj7584-117a

Moreno-Torres, J. G., Raeder, T., Alaiz-Rodríguez, R., Chawla, N. V., & Herrera, F. (2012). A unifying view on dataset shift in classification. Pattern Recognition, 45(1), 521-530. https://doi.org/10.1016/j.patcog.2011.06.019

Quiñonero-Candela, J. (Ed.). (2009). Dataset shift in machine learning. MIT Press.

Sáez, C., Ferri, P., & García-Gómez, J. M. (2024). Resilient Artificial Intelligence in Health: Synthesis and Research Agenda Toward Next-Generation Trustworthy Clinical Decision Support. Journal of Medical Internet Research, 26(1), e50295. https://doi.org/10.2196/50295

Sáez, C., & García-Gómez, J. M. (2018). Kinematics of Big Biomedical Data to characterize temporal variability and seasonality of data repositories: Functional Data Analysis of data temporal evolution over non-parametric statistical manifolds. International Journal of Medical Informatics, 119, 109-124. https://doi.org/10.1016/j.ijmedinf.2018.09.015

Sáez, C., Gutiérrez-Sacristán, A., Kohane, I., García-Gómez, J. M., & Avillach, P. (2020). EHRtemporalVariability: Delineating temporal data-set shifts in electronic health records. GigaScience, 9(8), giaa079. https://doi.org/10.1093/gigascience/giaa079

Sáez, C., Rodrigues, P. P., Gama, J., Robles, M., & García-Gómez, J. M. (2015). Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Mining and Knowledge Discovery, 29(4), 950-975. https://doi.org/10.1007/s10618-014-0378-6

Sáez, C., Zurriaga, O., Pérez-Panadés, J., Melchor, I., Robles, M., & García-Gómez, J. M. (2016). Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: A systematic approach to quality control of repositories. Journal of the American Medical Informatics Association, 23(6), 1085-1095. https://doi.org/10.1093/jamia/ocw010

Schlegel, D. R., & Ficheur, G. (2017). Secondary Use of Patient Data: Review of the Literature Published in 2016. Yearbook of Medical Informatics, 26(1), 68-71. https://doi.org/10.15265/IY-2017-032

Shao, M., Li, D., Zhao, C., Wu, X., Lin, Y., & Tian, Q. (2024). Supervised Algorithmic Fairness in Distribution Shifts: A Survey (arXiv:2402.01327). arXiv. https://doi.org/10.48550/arXiv.2402.01327

Wang, Z., Bühlmann, P., & Guo, Z. (2024). Distributionally Robust Machine Learning with Multi-source Data (arXiv:2309.02211). arXiv. https://doi.org/10.48550/arXiv.2309.02211